perm filename CLCHAR.TEX[CLS,LSP] blob sn#871533 filedate 1989-03-23 generic text, type T, neo UTF8
\documentstyle{report}     % Specifies the document style.

\pagestyle{headings}

\title{\bf
Extensions to Common LISP to Support International
Character Sets}
\author{
Michael Beckerle\thanks{Gold Hill Computers} \and
Paul Beiser\thanks{Hewlett-Packard} \and
Jerry Duggan\thanks{Hewlett-Packard} \and
Robert Kerns\thanks{Independent consultant} \and
Kevin Layer\thanks{Franz, Inc.} \and
Thom Linden\thanks{IBM Research, Subcommittee Chair} \and
Larry Masinter\thanks{Xerox Research} \and
David Unietis\thanks{Lucid, Inc.}
}
\date{March 20, 1989} % Deleting this command produces today's date.

\begin{document}

\maketitle                 % Produces the title.

\setcounter{secnumdepth}{4}

\setcounter{tocdepth}{4}
\tableofcontents


%----------------------------------------------------------------------
%----------------------------------------------------------------------
\newtheorem{prop}{Proposal}[section]
\newfont{\cltxt}{cmr10}
\newfont{\clkwd}{cmtt10}

\newcommand{\apostrophe}{\clkwd '}
\newcommand{\bq}{\clkwd\symbol{'22}}


%----------------------------------------------------------------------
%----------------------------------------------------------------------
\chapter{Introduction}

This is a proposal to the X3 J13 committee
for both extending and modifying the Common LISP
language definition to provide a standard basis for Common LISP
support of the variety of characters used to represent the
native languages of the international community.

This proposal was created by the Character Subcommittee of X3 J13.
We would like to acknowledge discussions with T. Yuasa and other
members of the JIS Technical Working Group,
comments from members of X3 J13,
and the proposals \cite{ida87},
\cite{linden87}, \cite{kerns87}, and \cite{kurokawa88} for
providing the motivation and direction for these extensions.
As all these documents and discussions were created
expressly for LISP standardization usage,
we have borrowed freely from their ideas as well as the texts
themselves.


\section{Objectives}

The major objectives of this proposal are:
\begin{itemize}
\item To provide a consistent, well-defined scheme allowing support
of both very large character sets and multiple character sets.
\footnote{The distinction between the terms {\em character repertoire}
and {\em coded character set} is made later.  The usage
of the term {\em character set},
avoided after this introduction, encompasses both terms.}

Many software applications are intended for international use, or
have requirements for incorporation of language elements of multiple
native languages within a single application.
Also, many applications require specialized languages including,
for example, scientific and typesetting symbols.
In order
to ensure some portability of these applications, data expressed in
a mixture of these
languages must be treated uniformly by the
software language.

All character and string manipulations should operate uniformly,
regardless of the character set(s) of the character objects.
This applies to array indexing, readtable definitions, read
symbol construction and I/O operations.


\item To ensure efficient performance of string and character
operations.

Many native
languages, such as Japanese and Chinese, use character
sets which contain more characters than the Latin alphabet.
Supporting larger sized character sets frequently means employing
larger data fields to uniquely encode each character.
Common LISP implementations using
larger sized character sets can
incur performance penalties in terms
of space, time, or both.

The use of large and/or multiple character sets by an
implementation
implies the need for a more complex character type representation.
Given a more complex character representation, the efficiency
of language operations on characters (e.g. string operations)
could be affected.

\item To assure forward compatibility of the proposed model
and definition with existing Common LISP implementations.

Developers should not be required to re-write large amounts of either
LISP code or data representations in order to apply the proposed
changes to existing implementations.
The proposed changes should provide an easy
portability path for existing code to many possible implementations.
\end{itemize}

There are a number of issues, some under the general rubric of
internationalization, which this proposal does {\em not} cover.
Among these issues are:
\begin{itemize}
\item Time and date formats
\item Monetary formats
\item Numeric punctuation
\item Fonts
\item Lexicographic orderings
\item Right-to-left and bidirectional languages
\end{itemize}

%----------------------------------------------------------------------
%----------------------------------------------------------------------
%----------------------------------------------------------------------
%----------------------------------------------------------------------
\chapter{Overview}

We use several terms within this document which
are new in the context of Common LISP.
Definitions for the following prominent
terms are provided for the reader's convenience.

A {\em character repertoire} defines a collection of characters
independent of their specific rendered image or font.  This
corresponds to the mathematical notion of a {\em set}
\footnote{We avoid the term {\em character set} as it has been
(over)used in the context of character repertoire as well
as in the context of coded character set.}.
Character
repertoires are specified independent of coding and their characters
are only identified with a unique {\em character label},
a graphic symbol, and
a character description.

A {\em coded character set} is a character repertoire plus
an {\em encoding} providing a unique mapping between each character
and a number which serves as the character representation.
There are numerous internationally standardized coded character
sets; for example, \cite{iso8859/1} and \cite{iso646}.

A character may be included in one or more character repertoires.
Similarly, a character may be included in one or more
coded character sets.  For example, the Latin letter "A" is contained
in the coded character set standards: ISO 8859/1, ISO 8859/2,
ISO 6937/2, and others.

To universally identify each character, we define a unique
collection of repertoires called {\em character
registries} as a partitioning of all characters.
That is, each character is included
in one and only one character registry.

In Common LISP a {\em character} data object is identified by its
{\em character code}, a unique numerical code.
Each character code is composed from
a character registry and a character label.

Character data objects which are classified as {\em graphic},
or displayable, are each associated with a {\em glyph}.  The
glyph is the visual representation of the character.
Character data objects which are not graphic are classified
as {\em control}.


The primary purpose of introducing these terms is to provide a
consistent naming to Common LISP concepts which are related
to those found in ISO standardization of coded
character sets.
\footnote{The bibliography includes several relevant ISO
coded character set standards.}
They also serve as a demarcation between these
standardization activities.  For example, while Common LISP is free to
define unique manipulation facilities for characters, registries
and coded character sets, it should
not define standard coded character sets nor standard character
registries.

A secondary purpose is to detach the language specification from
underlying hardware representation.  From a language
specification viewpoint it is inconsequential whether
characters occupy one or more (8-bit) bytes or whether
a Common LISP implementation's
internal representation for characters is distinct from or identical
to any of the numerous
external representations (for example, the text interchange
representation \cite{iso6937/2}).
We specifically do not propose any standard coded character sets.

A final purpose is to serve as a basis for terminology within the
standard language specification.

\begin{prop}
The terminology introduced in this proposal will be included
in the language specification at the discretion of the editor.
\end{prop}


%----------------------------------------------------------------------
\section{Character Identity}

Characters are uniquely distinguished by their codes,
which are drawn from the set of
non-negative integers.  That is, within Common LISP
a unique numerical code
is assigned to each semantically different character.

It is important to separate the notion of glyph from the notion of
character data object when defining a scheme under which issues of
identity can be rigorously decided by a computer language.  Glyphs are
the visual aspects of characters, writable on surfaces, and sometimes
called 'graphics'.  A language specification valid for more than a
narrow range of systems can only make assumptions about the existence
of {\em abstract} glyphs (for example, the Latin letter A) and not about
glyph variants (for example, the italicized Latin letter {\em A})
or characteristics of display devices.

The notion of attributes of character
objects within Common LISP has proven to be either not used or
not portable.  The essential aspect of the following proposals is
to what extent attributes continue to be supported by the
language specifications.

\begin{prop}[Alternative A]
 Remove all discussion of attributes from
 the language specification.  Add the following discussion:
\begin{quote}
Earlier versions of Common LISP incorporated {\em font} and
{\em bits} as attributes of character objects.  These and other
supported attributes are considered implementation-defined
attributes and if supported by an implementation effect the
action of selected functions.
\end{quote}
 All types, constants and functions
 dealing with the {\em bits} and {\em font} attributes are either
 removed or modified as follows:
\begin{itemize}
\item Modify {\clkwd char-=}: If two characters differ in any
implementation-defined attributes, then they are not {\clkwd char-=}.
\item Modify {\clkwd char-<}: If two characters have identical
  implementation-defined attributes, then their ordering by
  {\clkwd char}$<$ is consistent with the numerical ordering by the
  predicate $<$ on
  their code. (Similarly for {\clkwd char}$>$,
  {\clkwd char}$>=$ and {\clkwd char}$<=$.)
\item Modify {\clkwd char-equal}:
The effect, if any, on {\clkwd char-equal} of each
  implementation-defined attribute has to be specified as part of
  the definition of that attribute (and similarly for
  {\clkwd char-not-equal, char-lessp, char-greaterp,
   char-not-greaterp, char-not-lessp}).
\item Modify {\clkwd char-upcase} and {\clkwd char-downcase}:
The effect of {\clkwd char-upcase} and {\clkwd char-downcase}
  is to preserve implementation-defined attributes.
\item  Modify {\clkwd read}: It is implementation dependent which
  attributes are removed from symbol names.
  It is implementation dependent which attributes are removed
  from characters within double quotes.
\item  Modify {\clkwd intern}: It is implementation dependent,
but consistent with the {\clkwd read} function,
which implementation-defined attributes are removed.
\item  Modify {\clkwd digit-char}: remove the optional {\em font}
argument.
\item  Modify {\clkwd code-char}: remove the optional {\em font}
and {\em bits} arguments.
\item Remove {\clkwd char-font-limit}
\item Remove {\clkwd char-bits-limit}
\item Remove {\clkwd int-char}
\item Remove {\clkwd char-int}
\item Remove {\clkwd char-bits}
\item Remove {\clkwd char-font}
\item Remove {\clkwd make-char}
\item Remove {\clkwd char-control-bit}
\item Remove {\clkwd char-meta-bit}
\item Remove {\clkwd char-super-bit}
\item Remove {\clkwd char-hyper-bit}
\item Remove {\clkwd char-bit}
\item Remove {\clkwd set-char-bit}
\item Redefine {\clkwd string-char} as implementation defined
as either {\clkwd base-character} or {\clkwd character}.
\item Modify readtable: If implementation-defined attributes
are supported, an implementation need not (but may) allow
for such characters to have syntax descriptions in the readtable.
Otherwise, all characters are representable in the readtable.
\end{itemize}
\end{prop}

\begin{prop}[Alternative B]
 This is identical to all of Alternative A (above) except that
 the function {\clkwd char-int} is retained for hashing purposes.
 {\clkwd char-int} returns a non-negative integer encoding the
 character object.  The manner in which the integer is computed
 is implementation dependent. In contrast to {\clkwd sxhash},
 the result is not guaranteed independent of the particular
 "incarnation" or "core image".
\end{prop}

With the elimination of {\em font} and {\em bits} from the
specification the usefulness of {\clkwd char-code} and {\clkwd
code-char} is diminished.  They are no longer needed for constructing
characters.
The portable mechanisms for hashing are provided by
{\clkwd char-int} and {\clkwd sxhash}.

In addition, using {\clkwd char-code-limit} to iterate over
characters is extremely inefficient in implementations that
support large or user-defined repertoires.

\begin{prop}[Alternative C]
 This an amendment to Alternative B (above).
\begin{itemize}
\item Remove {\clkwd char-code-limit}
\item Remove {\clkwd char-code}
\item Remove {\clkwd code-char}
\end{itemize}
\end{prop}

%----------------------------------------------------------------------
\section{Standard and Semi-Standard Characters}

The standard characters are the 96 characters used in the Common LISP
definition {\bf or their equivalents}.

This was the Common LISP \cite{steele84} definition, but
{\em equivalents} is a vague term.

The standard characters are not defined by their glyphs, but by their
roles within the language.  There are two aspects to the roles of the
standard characters: one is their role in reader and format control
string syntax; the second is their role as components of the names of
all Common LISP
functions, macros, constants, and global variables.  As
long as an implementation chooses 96 glyphs
and treats those 96 in a manner consistent with
the language's specification for the standard characters (e.g.
the naming of functions), it doesn't matter what glyphs the I/O
hardware uses to represent those characters: they are the standard
characters.  Any program or
data text written wholly in those characters
is portable through simple code conversion.
\footnote{For example, the currency glyph, \$ , might be replaced
uniformly by the currency glyph available on a particular display.}

Additional mechanisms,
such as in \cite{kurokawa88}, which support establishment of
equivalency between otherwise distinct characters are not excluded by
this proposal.
\footnote{We believe this is an important issue but it requires
additional implementation experience.  We also encourage
new proposals from JIS and ISO LISP Working Groups on this issue.}


\begin{prop}
The discussion of standard characters is
replaced by the following:

  Common LISP requires all implementations to support a {\em standard}
  character subrepertoire.
  The Common LISP
  standard character subrepertoire consists of
  a newline \#$\backslash${\clkwd Newline}, the
  graphic space character \#$\backslash${\clkwd Space},
  and the following additional
  ninety-four graphic characters or their equivalents:
\footnote{\cltxt \#$\backslash${\clkwd Space}
and \#$\backslash${\clkwd Newline} are omitted.
graphic labels and descriptions are from ISO 6937/2.
The first letter of the graphic Id categorizes the
character as follows: L - Latin, N - Numeric, S - Special
.}

{\small \begin{tabular}{||l|c|l||l|c|l||}    \hline
  Id     &    Glyph    &  Name or description
& Id     &    Glyph    &  Name or description
\\ \hline
  LA01  &  a  &  small a
& ND01  &  1  &  digit 1
\\ \hline
  LA02  &  A  &  capital A
& ND02  &  2  &  digit 2
\\ \hline
  LB01  &  b  &  small b
& ND03  &  3  &  digit 3
\\ \hline
  LB02  &  B  &  capital B
& ND04  &  4  &  digit 4
\\ \hline
  LC01  &  c  &  small c
& ND05  &  5  &  digit 5
\\ \hline
  LC02  &  C  &  capital C
& ND06  &  6  &  digit 6
\\ \hline
  LD01  &  d  &  small d
& ND07  &  7  &  digit 7
\\ \hline
  LD02  &  D  &  capital D
& ND08  &  8  &  digit 8
\\ \hline
  LE01  &  e  &  small e
& ND09  &  9  &  digit 9
\\ \hline
  LE02  &  E  &  capital E
& ND10  &  0  &  digit 0
\\ \hline
  LF01  &  f  &  small f
& SC03  &  \$    &  dollar sign
\\ \hline
  LF02  &  F  &  capital F
& SP02  &  !     &  exclamation mark
\\ \hline
  LG01  &  g  &  small g
& SP04  &  "     &  quotation mark
\\ \hline
  LG02  &  G  &  capital G
& SP05  &  \apostrophe     &  apostrophe
\\ \hline
  LH01  &  h  &  small h
& SP06  &  (     &  left parenthesis
\\ \hline
  LH02  &  H  &  capital H
& SP07  &  )     &  right parenthesis
\\ \hline
  LI01  &  i  &  small i
& SP08  &  ,     &  comma
\\ \hline
  LI02  &  I  &  capital I
& SP09  &  \_    &  low line
\\ \hline
  LJ01  &  j  &  small j
& SP10  &  -     &  hyphen or minus sign
\\ \hline
  LJ02  &  J  &  capital J
& SP11  &  .     &  full stop, period
\\ \hline
  LK01  &  k  &  small k
& SP12  &  /     &  solidus
\\ \hline
  LK02  &  K  &  capital K
& SP13  &  :     &  colon
\\ \hline
  LL01  &  l  &  small l
& SP14  &  ;     &  semicolon
\\ \hline
  LL02  &  L  &  capital L
& SP15  &  ?     &  question mark
\\ \hline
  LM01  &  m  &  small m
& SA01  &  +     &  plus sign
\\ \hline
  LM02  &  M  &  capital M
& SA03  &  $<$   &  less-than sign
\\ \hline
  LN01  &  n  &  small n
& SA04  &  =   &  equals sign
\\ \hline
  LN02  &  N  &  capital N
& SA05  &  $>$   &  greater-than sign
\\ \hline
  LO01  &  o  &  small o
& SM01  &  \#    &  number sign
\\ \hline
  LO02  &  O  &  capital O
& SM02  &  \%    &  percent sign
\\ \hline
  LP01  &  p  &  small p
& SM03  &  \&    &  ampersand
\\ \hline
  LP02  &  P  &  capital P
& SM04  &  *     &  asterisk
\\ \hline
  LQ01  &  q  &  small q
& SM05  &  @     &  commercial at
\\ \hline
  LQ02  &  Q  &  capital Q
& SM06  &  [     &  left square bracket
\\ \hline
  LR01  &  r  &  small r
& SM07  &  $\backslash$   &  reverse solidus
\\ \hline
  LR02  &  R  &  capital R
& SM08  &  ]     &  right square bracket
\\ \hline
  LS01  &  s  &  small s
& SM11  &  \{    &  left curly bracket
\\ \hline
  LS02  &  S  &  capital S
& SM13  &  $|$     &  vertical bar
\\ \hline
  LT01  &  t  &  small t
& SM14  &  \}    &  right curly bracket
\\ \hline
  LT02  &  T  &  capital T
& SD13  &  \bq   &  grave accent
\\ \hline
  LU01  &  u  &  small u
& SD15  &  $\hat{ }$  &  circumflex accent
\\ \hline
  LU02  &  U  &  capital U
& SD19  &  $\tilde{ }$ &  tilde
\\ \hline
  LV01  &  v  &  small v
& & &
\\ \hline
  LV02  &  V  &  capital V
& & &
\\ \hline
  LW01  &  w  &  small w
& & &
\\ \hline
  LW02  &  W  &  capital W
& & &
\\ \hline
  LX01  &  x  &  small x
& & &
\\ \hline
  LX02  &  X  &  capital X
& & &
\\ \hline
  LY01  &  y  &  small y
& & &
\\ \hline
  LY02  &  Y  &  capital Y
& & &
\\ \hline
  LZ01  &  z  &  small z
& & &
\\ \hline
  LZ02  &  Z  &  capital Z
& & &
\\
\hline
\end{tabular} }

\end{prop}

The definition of semi-standard characters has been of minimum
practical use since implementations may or may not support any
of these characters.  The essential feature is that, when
supported, they have a predictable treatment by the reader.

\begin{prop}
Remove all discussion of semi-standard characters.
Add that in implementations supporting control characters other than
\#$\backslash${\clkwd Newline}, the {\clkwd read} function
is required to treat those as
whitespace characters.
\end{prop}

%----------------------------------------------------------------------
\section{Hierarchy of Types}

Providing support for extensive character repertoires may
impact Common LISP implementation performance in terms
of space, time, or both.
\footnote{This does not apply to all implementations.
Unique hardware support and user community requirements need to
be taken into consideration.}
In particular, many existing
implementations support variants of the ISO 8859/1 standard.
Supporting large
repertoires argues for a multi-byte internal representation
for each character, even if an application primarily (or exclusively)
uses the ISO 8859/1 characters.

This proposal extends the definition of the character and string
type hierarchy to allow specialized subtypes
of character and string.  An implementation is free to associate
compact internal representation tailored to each subtype.
The {\clkwd string} type specifier, when used for object
creation, for example in {\clkwd make-sequence},
is defined to mean the most general string subtype supported
by the implementation (similarly for the {\clkwd simple-string}
type specifier).  This definition emphasizes portability
of existing Common LISP applications to international
character environments over performance.  Applications emphasizing
efficiency of text processing in non-international environments
will require some modification to utilize subtypes with
compact internal representations.

It has been suggested that either a single type is
sufficient to support international characters,
or that a hierarchy of types could be used, in a manner
transparent to the user.  A desire to provide flexibility which
encourages implementations to support international
characters without compromising application efficiency
led us to accept the need for more than one type.
We believe that these choices reflect a minimal
modification of this aspect of the type system, and that
exposing the types for string and character construction while
requiring uniform treatment for characters otherwise
is the most reasonable approach.


\subsection{Character Type}

\begin{prop}
  Define {\clkwd base-character} as {\clkwd
(upgraded-array-element-type 'standard-char)}.
Characters of type {\clkwd base-character} are referred to as
{\em base characters}.  Characters of type {\clkwd
(and character (not base-character))}
are referred to as {\em extended characters}.
\end{prop}

This establishes the relationship between the string encoding and
array upgrading strategies of the implementation and
the important character types.

An implementation may support additional subtypes of {\clkwd character}
which may or may not be supertypes of {\clkwd base-character}.
In addition, an implementation may define {\clkwd base-character}
as equivalent to {\clkwd character}.

The base characters are
distinguished in the following respects:
\begin{itemize}
\item
The standard characters are a subrepertoire of the base characters.
\item
The selection of base characters which are not standard characters
is implementation defined.
\item
Only members of the base character repertoire
can be elements of a base string.
\item
No upper bound is specified for the number of glyphs in the base
character repertoire--that
is implementation dependent.  The lower bound is 96, the
number of standard characters defined for Common LISP.
\footnote{Or, in contrast, the base repertoire may include all
implementation supported characters.}
\end{itemize}

The distinction of base characters is largely a pragmatic
choice.  It permits efficient handling of common situations, may
be privileged for host system I/O, and can serve as an
intermediate basis for portability, less general than the standard
characters, but possibly more useful across a narrower range of
implementations.

Many computers have some "base" character representation which
is a function of hardware instructions for dealing with characters,
as well as the organization of the file system.  The base character
representation is likely to be the smallest transaction unit permitted
for text file and terminal I/O operations.  On a system with a record
based I/O paradigm, the base character representation is likely to
be the smallest record quantum.  On many computer systems,
this representation is a byte.

However,
the proposal emphasizes that whether a character is "base" to
Common LISP depends on the way that an implementation represents
strings, and not any other properties of the implementation or the
host operating system.  Imagine two implementations, one of which
encodes all strings as 16-bit characters, and another which has
two kinds of strings: 8-bit strings and 16-bit strings.  In the
first implementation, the {\clkwd base-character} is
{\clkwd character}: there's only one kind of string.  In the
second implementation, the {\clkwd base-character} would be those
that could be stored in an 8-bit string, and it would be a proper
sub-type of {\clkwd character}.


\subsection{String Type}

\begin{prop}
The {\clkwd string} type
is defined as
a union type.  More precisely, a string
is a specialized vector whose elements are of type
{\clkwd character} or a subtype of {\clkwd character}.
{\clkwd string} used as a type specifier for object creation
means {\clkwd (vector character)}.
\end{prop}

\begin{prop}
The following string
subtypes are
distinguished with standardized names.
\begin{itemize}
\item {\clkwd base-string} is equivalent to {\clkwd (vector
base-character)}.
Strings of type {\clkwd base-string} are referred to as {\em base
strings}.  Strings which are not base strings are referred to
as {\em extended strings}.
\item {\clkwd general-string} is equivalent to {\clkwd (vector
character)}.
\item Both are valid as type specifiers that abbreviate.
\end{itemize}

During reader
construction of symbols, if all the characters
in the symbol's name are of type {\clkwd base-character},
then the name of the symbol may be stored as a base string.
Otherwise it will be stored as an extended string.
\end{prop}

\begin{prop}
Define {\clkwd simple-string} as a union type.
A simple
string is a specialized simple vector whose elements are of type
{\clkwd character} or a subtype of character.
{\clkwd simple-string} used as a type specifier for object creation
means {\clkwd (simple-array character ({\em size}))}.
\end{prop}

\begin{prop}
The following simple string
subtypes are
distinguished with standardized names:
\begin{itemize}
\item {\clkwd simple-base-string} is equivalent to {\clkwd
(simple-array base-character (*)). simple-base-string} is a subtype
of {\clkwd base-string}.
\item {\clkwd simple-general-string} is equivalent to {\clkwd
(simple-array character (*)). simple-general-string} is a subtype
of {\clkwd general-string}.
\item Both are valid as type specifiers that abbreviate.
\end{itemize}
\end{prop}

A base string is the most efficient string which can hold
the standard characters.
A {\clkwd general-string}
can contain any implementation supported base or extended characters,
in any mixture.

All Common LISP functions defined to operate on strings treat
base and extended strings uniformly with the following
caveat: for any function which inserts a character into a string, it
is an error to insert an extended character
into a base string.
\footnote{An implementation may, optionally, provide automatic
coercion to an extended string.}

An implementation may support string subtypes in addition
to {\clkwd base-string} and
{\clkwd general-string}.
For example, a hypothetical
implementation supporting Arabic and Cyrillic characters
might provide as extended characters:
\begin{itemize}
\item {\clkwd general-string} -- may contain Arabic, Cyrillic or
base characters in any mixture.
\item {\clkwd region-specialized-string} -- may contain installation
selected repertoire (Arabic/Cyrillic) or base characters in any
mixture.
\item {\clkwd base-string} -- may contain base characters
\end{itemize}
Though, clearly, portability of applications using
{\clkwd region-specialized-string} is limited, a performance
advantage might argue for its use.
\footnote{{\clkwd region-specialized-string} is used here for
illustration only; it is not being proposed as a standardized
string subtype.}

Alternatively,
an implementation
supporting a large base character repertoire
including, say, Japanese Kanji may define
{\clkwd base-character}
as equivalent to {\clkwd character}.

We expect that applications sensitive to the performance
of character handling in some host environments will
utilize the string subtypes to provide performance
improvement.  Applications with emphasis on international
portability will likely utilize only {\clkwd general-string}s.

The base string type allows for more compact representation of strings
of base characters, which are likely to predominate in any system.
Note that in any particular implementation the base characters
need not be the
most compactly representable, since others might have
a smaller repertoire.
However, in most implementations base strings are
likely to be more space efficient than extended strings.

\begin{prop}
Extend the {\clkwd make-string} function to allow an
{\clkwd element-type} keyword argument:
\begin{itemize}
\item {\clkwd make-string} {\em size}
{\clkwd \&key :initial-element :element-type} [Function]

This returns a simple string of length {\em size}, each
of whose characters has been initialized to the
{\clkwd :initial-element} argument.  If an {\clkwd :initial-element}
argument is not specified, then the string will be
initialized in an implementation-dependent way.  The
{\clkwd :element-type} argument names the type of the elements
of the string; a string is constructed of the most specialized
type that can accommodate elements of the given type.  If
{\clkwd :element-type} is omitted, the type {\clkwd character}
is the default.
\end{itemize}
\end{prop}

%----------------------------------------------------------------------
\section{Character Naming}

A Common LISP program should be able to name, compose and decompose
characters in a uniform, portable manner, independent of any
underlying representation.  One possible composition is by
the pair $<$ coded character set standard, decimal representation $>$
\footnote{This syntax is for illustration only and is not being
proposed.}.
Thus, for example, one might compose the Latin 'A' with the pair
$<$ ISO8859/2-1987, 65 $>$,
$<$ ISO8859/6-1987, 65 $>$, or
$<$ ISO646-1983, 65 $>$, etc..  The difficulty here is two-fold.
First, there are several ways to compose the same character and
second, there may be multiple answers to
the question: {\em To what coded character set
does character object x belong?}\footnote{Even
worse, the answer might change yearly.}
The identical problems occur if the pair
$<$ character repertoire standard, decimal representation $>$ is used.
\footnote{Existing ISO repertoires seem to be defined exclusively
in the context of coded character sets and not as standards
in their own right.}

The concept of character registry is introduced by this proposal
to resolve the problem of character naming, composition and
decomposition.
Each character is universally defined by the
pair $<$ character registry name, character label $>$. For this
to be a portable definition, it must have a standard meaning.
Thus we propose the formation of an ISO Working Group to
define an international
{\em Character Registry Standard}.
At this writing there is no existing Character Registry Standard nor
ISO Working Group organized to define such a standard.
\footnote{It is the intention of X3 J13 to promote and adopt
an eventual ANSI or ISO Character Registry Standard.  In particular, we
acknowledge that X3 J13 is {\em not} the appropriate forum to
define the standard.  We believe
it is a required component of all programming languages
providing support for international characters.}

\begin{prop}
Common LISP character codes are composed from a character registry and
a character label.  The convention by which a character label and
character registry compose a character code is implementation
dependent.
\end{prop}

The naming and content of the standard character registries
is left unspecified by this proposal.
\footnote{The only constraint is that character registries and
labels be named using only the Latin capital letters A-Z and
digits 0-9.}
Below are some candidate character registry names:
\begin{itemize}
\item Arabic
\item Armenian
\item Bopomofo
\item Control   (meaning the collection of standard text communication
control codes)
\item Cyrillic
\item Georgian
\item Greek
\item Hangul
\item Hebrew
\item Hiragana
\item JapanesePunctuation
\item Kanji
\item Katakana
\item Latin
\item LatinPunctuation
\item Mathematical
\item Pattern
\item Phonetic
\item Technical
\end{itemize}
The list above is provided as a starting point for discussion
and is not intended to be representative
nor exhaustive.  The Common LISP language definition does not
depend on these names nor any specific content (for example:
Where should the plus sign appear?).  It is application
programs which require a reliable definition of the
registry names and their constituents.  The Common LISP language
definition imposes the framework for constructing and manipulating
character objects.

\begin{prop}
Standardized Character Registries are fixed;
an implementation may not extend a standard registry's
constituent set of characters beyond the
standard definition.

An implementation may provide support for all or part of any
character registry
and may provide new character registries which include characters
having unique semantics (i.e. not defined in any standard
character registry).
Implementation registries must be uniquely
named using only Latin capital letters A-Z and digits 0-9.

An implementation must document the registries it supports.
For each registry supported the documentation must include
at least the following:
\begin{itemize}
\item Character Labels,
Glyphs, and Descriptions.  Character labels must be uniquely
named using only Latin capital letters A-Z and digits 0-9.
\item Reader Canonicalization.
\footnote{Any mechanisms by which the {\clkwd read} function treats
distinct characters as equivalent.}
\item Effect of character predicates. In particular,
\begin{itemize}
\item {\clkwd alpha-char-p}
\item {\clkwd lower-case-p}
\item {\clkwd upper-case-p}
\item {\clkwd both-case-p}
\item {\clkwd graphic-char-p}
\item {\clkwd alphanumericp}
\end{itemize}
\item Interaction with File I/O.  In particular, the
coded character sets
\footnote{For example, ISO8859/1-1987.} and
external encoding schemes
supported are documented.
\end{itemize}
\end{prop}

We introduce new functions to
compose and decompose character objects.  We also extend the
{\clkwd characterp} predicate to
support testing
membership of a character in a given character repertoire.
\footnote{
For example,
testing membership in the Japanese Katakana character repertoire.
}
A global variable {\clkwd *all-character-registry-names*}
is added to
allow application determination of
implementation supported character registries.

\begin{prop}
Add the type specifier and (modified) type predicate:
\begin{itemize}
\item {\clkwd (character {\em repertoire})}

This denotes a character type specialized to members of the
specified repertoire.  {\em Repertoire} may be {\clkwd :base}
or {\clkwd :standard} or any supported character repertoire
name (a keyword symbol), or a list of names.

{\clkwd (character :base)} is equivalent to {\clkwd base-character}
and
{\clkwd (character :standard)} is equivalent to {\clkwd standard-char}
\item {\clkwd (characterp {\em object} \&optional
{\em repertoire})}

If {\em repertoire} is omitted, {\clkwd characterp} is true if
{\em object} is a character object, and otherwise is false.  If
a {\em repertoire} argument is specified, {\clkwd characterp}
is true if {\em object} is a character object and a member
of the specified repertoire, and otherwise is false.  {\em Repertoire}
may be any supported character repertoire name (a keyword symbol)
or the names {\clkwd :base} or {\clkwd :standard}.
{\clkwd (characterp x :standard)} is equivalent to
{\clkwd (standard-char-p x)}.
{\clkwd (characterp x :base)} is true if x is a member of the
base character repertoire.


\end{itemize}

\end{prop}

\begin{prop}
Add the following variable and functions:
\begin{itemize}
\item {\clkwd *all-character-registry-names*} {\em  [Variable]}

The value of {\clkwd *all-character-registry-names*} is a list
of all character repertoire names (keyword symbols) supported by
the implementation.
\item {\clkwd char-label} {\em char [Function]}

{\clkwd char-label} returns a string representing the character
label of {\em char}.  It is an error if the argument is
not a character object.
\item {\clkwd char-registry-name} {\em char [Function]}

{\clkwd char-registry-name} returns a string representing the character
registry to which {\em char} belongs. It is an error if the
argument is not a character object.
\item {\clkwd find-char} {\em registry label [Function]}

{\clkwd find-char} returns a character object.  The arguments
{\em registry} and {\em label} are names (keyword symbols) of
a character registry and label.  {\em label} uniquely
identifies a character within the character registry named
{\em registry}.  If the implementation does not support the
specified character, {\clkwd nil} is returned.
\end{itemize}

\end{prop}

\begin{prop}
Character
names accepted and constructed by {\clkwd char-name, name-char,
and read} are extended to include character registry names of
the form {\em registry:label}.
\end{prop}

%----------------------------------------------------------------------
\section{Streams and System I/O}

A lot of the work of ensuring that a
Common LISP implementation operates correctly in a
multiple coded character set environment must be performed by
the I/O interface.
The system I/O interface, abstracted in
Common LISP as streams, is responsible
for ensuring that text input from outside LISP is properly mapped
into character objects internally, and that the inverse mapping
is performed on output.  It is beyond the scope of a language
definition to specify the details of this operation, but options
are specified which allow runtime indication from the user as to
what coded character sets a stream uses, and how the mappings
should be done.  It is expected that implementations will provide
reasonable defaults and invocation options to accommodate desired use
at an installation.

There are often multiple
coded character sets supportable on a
computer, through the use of special display and entry hardware, which
are varying interpretations of the basic system character
representation.  For example, ISO 8859/1 and ISO 6937/2 are two
different interpretations of the same 1-byte code representations.
Many countries have their own glyph-to-code mappings for 1-byte
character codes addressing the special requirements of national
languages.  Differentiating between these, without reference to
display hardware, is a matter of convention, since they all use the
same set of code representations.  When a single byte is not enough,
two or more bytes are sometimes used for character encoding.  This
makes character handling even more difficult on machines where the
natural representation size is a byte, since not only is the semantic
value of a character code a matter of convention, which may vary
within the same computing system, but so is the identification of a
set of bits as a complete character code.

Given that multiple coded character sets exist, it is useful
to provide portable mechanisms based on their definitions.

\begin{prop}
Add the following functions:
\begin{itemize}
\item {\clkwd char-external-code} {\em char name [Function]}

{\clkwd char-external-code} returns the non-negative integer
representing the encoding of the character {\em char} in the
coded character set named by {\em name}, a keyword symbol.  If
the implementation does not support the specified coded
character set, {\clkwd nil} is returned.  If the named
coded character set does not contain the character,
{\clkwd nil} is returned.
\item {\clkwd find-external-char} {\em name index [Function]}

{\clkwd find-external-char} returns a character object.
The argument {\em index} is a non-negative integer
representing the encoding of a character in the
coded character set named by {\em name}, a keyword symbol.  If
the implementation does not support the specified coded
character set, {\clkwd nil} is returned.  If the named
coded character set does not contain the character,
{\clkwd nil} is returned.
\end{itemize}
\end{prop}

An implementation supporting multiple coded character sets
must allow for the external
representation of characters to be separately (and perhaps
multiply) specified to {\clkwd open},
since there can be circumstances under
which more than one external representation for characters
is in use, or more than one coded character set
is mixed together in an
external representation convention.

Which coded character sets and encoding schemes
are supported by the overall computing system and the
details of the mapping of glyphs to characters
to character codes are
left unspecified by Common LISP.


\begin{prop}
Add the additional keyword argument to {\clkwd open}:
\begin{itemize}
\item {\clkwd :external-code}
which
specifies a name, or list of names (keyword symbols)
indicating an implementation recognized scheme for
representing 1 or more coded character sets with non-homogeneous codes.

The default value is {\clkwd :default} and is
implementation defined but must include the
base characters.

As many coded character set names must be provided as the
implementation requires for that external coding convention.

Coded character set names must
include the full reference number and approval year. For example,
:ISO8859P1V1987 and :ISO6937P2V1983.
All implementation recognized schemes are formed from
the Latin uppercase A-Z and digit 0-9 characters.
\end{itemize}

This argument is provided for input, output, and
bidirectional streams.
It is an error to try to write a character other than a
member of the specified coded character sets
to a stream.  (This excludes the
\#$\backslash${\clkwd Newline} character.
Implementations must provide appropriate line division behavior
for all character streams.)
\end{prop}

The existing default for the {\clkwd :element-type} argument of
{\clkwd open} is {\clkwd string-char}.  This is no longer appropriate
given the diminished use of {\clkwd string-char} within the
standard specification.

\begin{prop}
Modify the {\clkwd :element-type} argument to {\clkwd open} as follows:
\begin{itemize}
\item Add {\clkwd base-character} as a valid type.
\item Remove {\clkwd string-char} as a valid type.
\end{itemize}
\end{prop}

The following alternative is consistent with the general
premise that portability is emphasized over efficiency.

\begin{prop} (Alternative A)
The default for the {\clkwd :element-type} argument of {\clkwd open}
is {\clkwd character}.
\end{prop}

The following alternative (B), allows implementations to match
the behavior of {\clkwd open} to the expected behavior of
their file systems.

\begin{prop} (Alternative B)
The default for the {\clkwd :element-type} argument of {\clkwd open}
is implementation defined as either {\clkwd base-character}
or {\clkwd character}.
\end{prop}


\begin{prop}
Modify the following functions:
\begin{itemize}
\item {\clkwd with-output-to-string} if no string argument is
provided, produces a stream that accepts all characters and returns
a string of the most specialized type
that accommodates the characters that were actually output.
\item {\clkwd make-string-output-stream}
produces a stream that accepts all characters and returns
(via {\clkwd get-output-stream-string})
a string of the most specialized type
that accommodates the characters that were actually output.
\end{itemize}
\end{prop}


In addition to supporting conversion at the system interface, the
language must allow user programs to determine how much space data
objects will require when output in whichever external representations
are available.

This function is necessary
to determine if strings can be written to fixed length
fields in databases.  Note that this
function does not
address the problem of calculating
screen width of strings printed in proportional fonts.

\begin{prop}
Add the following function:
\begin{itemize}
\item {\clkwd string-encoded-length} {\em object}
{\clkwd \&optional} {\em output-stream} [Function]

{\clkwd string-encoded-length} returns the number of
implementation defined units required for the object on the
output stream.  If not applicable to the output stream, the
function returns {\clkwd nil}.  This number
corresponds to the current state of the stream and may change if
there has been intervening output.  If the
output stream is not specified {\clkwd *standard-output*} is
the default.
\end{itemize}
\end{prop}



%----------------------------------------------------------------------
\section{Miscellaneous}

In the process of creating this document, some comments were found
within CLtL which seem appropriate to modify independently of
the other proposals mentioned previously.  For each, we identify
the existing statement of CLtL and the recommended change.

%----------------------------------------------------------------------
%----------------------------------------------------------------------

\newcommand{\edithead}{\begin{tabular}{l p{3.95in}}
  \multicolumn{2}{l} }

\newcommand{\csdag}{\bf$\Rightarrow$\ddag}

\newcommand{\editstart}{}

\newcommand{\editend}{\\ & \end{tabular}}

%----------------------------------------------------------------------
%----------------------------------------------------------------------

%----------------------------------------------------------------------

\begin{prop}

\edithead {\csdag (p12) Chapter 2 Data Types}
\editstart
\\ \bf replace &
\cltxt
   provides for a
   rich character set, including ways to represent characters of various
   type styles.
\\ \bf with &
\cltxt
   provides support for international language characters as well
   as characters used in specialized arenas, eg. mathematics.
\editend
\end{prop}


\begin{prop}

\edithead {\csdag (p25) Chapter 2 Symbols}
\editstart
\\ \bf replace &
\cltxt
  A symbol may have uppercase letters, lowercase letters, or
  both in its print name.
\\ \bf with &
\cltxt
  A symbol may have characters from any supported character
  repertoire (except control characters) in its print name.
\editend
\end{prop}

\begin{prop}

\edithead {\csdag (p163) Chapter 10 Symbols}
\editstart
\\ \bf replace &
\cltxt
  It is ordinarily not permitted to alter a symbol's print name.
\\ \bf with &
\cltxt
  It is an error to alter a symbol's print name.
\editend
\end{prop}

\begin{prop}

\edithead {\csdag (p168) Chapter 10 The Print Name}
\editstart
\\ \bf replace &
\cltxt
  It is an extremely bad idea to modify a string being used
  as the print name of a symbol.
\\ \bf with &
\cltxt
  It is an error to modify a string being used
  as the print name of a symbol.
\editend
\end{prop}


\begin{prop}

\edithead {\csdag (p249,make-sequence) Chapter 14 Simple Sequence
Functions}
\editstart
\\ \bf append &
\cltxt
  If type {\clkwd string} is specified, the result is
  equivalent to {\clkwd make-string}.
\editend
\end{prop}

%----------------------------------------------------------------------
%----------------------------------------------------------------------
%----------------------------------------------------------------------
\begin{thebibliography}{wwwwwwww 99}


\bibitem[Ida87]{ida87} M. Ida, et al.,
{\em
JEIDA Common LISP Committee Proposal on Embedding Multi-Byte Characters
},
ANSI X3J13 document 87-022, (1987).

\bibitem[ISO 646]{iso646} ISO,
{\em
Information processing -- ISO 7-bit coded character set
for information interchange
},
ISO (1983).

\bibitem[ISO 4873]{iso4873} ISO,
{\em
Information processing -- ISO 8-bit code for information
interchange -- Structure and rules for implementation
},
ISO (1986).

\bibitem[ISO 6937/1]{iso6937/1} ISO,
{\em
Information processing -- Coded character sets for text
communication -- Part 1: General introduction
},
ISO (1983).

\bibitem[ISO 6937/2]{iso6937/2} ISO,
{\em
Information processing -- Coded character sets for text
communication -- Part 2: Latin alphabetic and non-alphabetic
graphic characters
},
ISO (1983).

\bibitem[ISO 8859/1]{iso8859/1} ISO,
{\em
Information processing -- 8-bit single-byte coded
graphic character sets -- Part 1: Latin alphabet No. 1
},
ISO (1987).

\bibitem[ISO 8859/2]{iso8859/2} ISO,
{\em
Information processing -- 8-bit single-byte coded
graphic character sets -- Part 2: Latin alphabet No. 2
},
ISO (1987).

\bibitem[ISO 8859/6]{iso8859/6} ISO,
{\em
Information processing -- 8-bit single-byte coded
graphic character sets -- Part 6: Latin/Arabic alphabet
},
ISO (1987).

\bibitem[ISO 8859/7]{iso8859/7} ISO,
{\em
Information processing -- 8-bit single-byte coded
graphic character sets -- Part 7: Latin/Greek alphabet
},
ISO (1987).

\bibitem[Kerns87]{kerns87} R. Kerns,
{\em
Extended Characters in Common LISP
},
X3J13 Character Subcommittee document, Symbolics Inc (1987).

\bibitem[Kurokawa88]{kurokawa88} T. Kurokawa, et al.,
{\em
Technical Issues on International Character Set Handling in Lisp
},
ISO/IEC SC22 WG16 document N33, (1988).

\bibitem[Linden87]{linden87} T. Linden,
{\em
Common LISP - Proposed Extensions for International Character Set
Handling
},
Version 01.11.87, IBM Corporation (1987).

\bibitem[Steele84]{steele84} G. Steele Jr.,
{\em
Common LISP: the Language
},
Digital Press (1984).

\bibitem[Xerox87]{xerox87} Xerox,
{\em
Character Code Standard, Xerox System Integration Standard
},
Xerox Corp. (1987).

\end{thebibliography}

\end{document}             % End of document.